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Abstract 

The effect of using past queries to im prove automatic 
query expansion was examined in the TREC environment. 
Automatic feedback of documents identified from similar 
past queries was compared with standard top 'document feed- 
back and with no feedback. A new query similarity metric 
was used based on comparing result lists and using prob- 
ability of relevance. Our top-document feedback method 
showed small improvements over no feedback method consis- 
tent with past studies. On recall-precision and average pre- 
cision measures, past query feedback yielded performance 
superior to that of top-document feedback. The past query 
feedback method also lends itself to tunable thresholds such 
that better performance can be obtained by automatically 
deciding when, and when not, to apply the expansion. Auto- 
matic past-query feedback actually improved top document 
precision in this experiment. 

1 Introduction 

Mult i- gigabyte databases are de retrueur today and the state 
of the art is pushing beyond terabyte sizes. Yet wide-spread 
experience with public and private search systems tells us 
that searchers are either unwilling or unable to invest in 
query construction. Over a wide range of operational en- 
vironments, the average number of searcher-supplied query 
terms is frequently less than 2 and rarely more than 4. At 
the same time, the best performers in the TREC studies 
created queries with nearly 100 terms ([3], [1]). Clearly the 
developers of these systems believe that more terms lead to 
better effectiveness. It does seem intuitively obvious that as 
the data volumes increase, the number of search terms must 
also increase to provide the additional context necessary for 
effective retrieval - i.e., regardless of the engine ranking, how 
can 2 terms sift lOOGBytes of data? 1 

One approach to this problem is to rely more heavily on 
automatic query expansion. Historically, automatic expan- 
sion has been viewed as operationally impractical because, 

1 Except, of course, for that class of high precision queries where a 
few words can Identify exactly the need. 
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while it has helped recall, it generally hurts precision at 
the top of the result list. We expect that as extraction tech- 
niques and our ability to estimate or infer relevance improve, 
automatic expansion methods should help offset the prob- 
lems related to the paucity of user-supplied query terms and 
relevance judgments. 

It is appealing to think that the results of past searches 
may be a source of additional evidence to help future searches. 
In operational environments, close examination of query logs 
clearly indicates the overlap in topic interests between search- 
es. For example, the top 10 queries against the Open Text 
Web. Index (http://www.opentext.com) often account for 
more than 5% of all queries. Additionally, external events 
create database Hot spots. For example, during the heavy 
US East Coast snowstorm of January 1996, the Open Text 
Web Index experienced a sudden, profound, and sustained 
burst of activity due to the query weather, which prior to the 
storm had been rare. While these observations only consider 
queries that are identical, there is also great topical overlap 
between non- identical queries. A quick sample of daily query 
logs for the Open Text Index shows that between 5 and 9 of 
the top 10 most frequently occurring queries are all about 
one very specific and popular topic! 

In this paper we describe experiments using past queries 
as a source of evidence for automatic term expansion. The 
hypothesis is that the top documents selected by a query 
from a pool of documents that are themselves the top docu- 
ments from result lists of past similar queries are a good 
source for automatic term expansion. We compared the 
performance of this method against the unexpended base- 
line queries and against the baseline queries expanded with 
top-document feedback ([4}). We created a query similarity 
metric that used empirically derived probability of relevance 
information. We also evaluated the use of a threshold to con- 
trol, on a per query basis, whether or not to apply automatic 
expansion. In the following sections we describe the method- 
ology used, including the query similarity metric. Then we 
examine the results, review related efforts, summarize and 
discuss future directions. 

2 Description of the Feedback Methods 

The experiments were conducted with the TREC- 5 data set 
[9]. The set of evaluation queries were the 50 ad hoc TREC- 5 
topics and their relevance judgments. For past queries feed- 
back, we also used the 50 ad hoc TREC-3 (7] and 10 of the 
ad hoc TREC-4 queries (8] as the past queries. From the 50 
ad hoc TREC-5 queries, baseline queries were created man- 
ually from the topics and were expressed in a simple natural 
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query Language with stem expansion. No feedback, data 
ot knowledge resources were used to generate the baseline 
queries. The average number of terms per baseline query 
was about 5. The small number of terms reflected the fact 
that we used no manual augmentation resources to discover 
additional terms - consistent with real users in the opera- 
tional environments with which we are familiar. Manually 
created baseline queries presumably yielded a higher perfor- 
mance than automatically generated baseline queries would 
have, but this shouldn't effect the comparison of the two 
automatic feedback methods with each other. 

The search system was the commercial off-the-shelf Open 
Text 6.0 Livelink OEM search engine. 2 A run of the base- 
line queries served as a benchmark against which the other 
methods were compared . 

The complete TREC-5 data set was used for the exper- 
iment. It consists of 524929 documents from TREC disk- 2 
and disk -4, about 2 GBytes of text. Document length varies 
from collection to collection, with an average of 550 terms 
peT document. We used the standard TREC performance 
analysis methods. Average precision, exact precision, recall- 
precision curve and document-precision curve are defined by 
Harm an [9). 

To provide a framework for the evaluation of different 
automatic query expansion methods, and in particular the 
effect of past queries, we posit a simple mode] for automatic 
query expansion. This model (see Figure 1) isolates the doc- 
uments that are candidates for feedback in an affinity pool 
Query expansion is done by evaluating tbe query against the 
affinity pool and extracting terms from the top documents 
retrieved there. The extracted terms are used to augment 
the query, then the query is executed against the main cor- 
pus. 

Creation of the affinity pool seems overly complex for 
simple automatic feedback methods, such as top document 
feedback. However, it has allowed us to experiment with 
different methods for the creation of a query context. In 
these experiments, we also used the results of past queries 
to generate this pool, but in other configurations we've used 
external sources such as documents from web- browser his- 
tory files. This is similar to creating context by watching 
information consuming habits of users, such as reported by 
Maes (10]. Goker and McCmskey (6J. and Morita and Shin- 
oda [M]. 

The top-document and past queries feedback methods 
tested with this model are described below. 

2.1 Top-document feedback 

An automatic top-document feedback 3 run was created for 
the test queries. This method is virtually identical to the 
method described by Buckley, et al.[4j wherein the query is 
first executed against the corpus, the result set is ranked, 
some number of top documents are selected, key terms are 
extracted from the top documents, added to the query, and 
the query is re-executed. In terms of the feedback model 
described above, the affinity corpus is simply the top 200 
documents from the relevance-ranked result list of the base- 
line query, so top documents extracted from it are the same 
as the baseline query's top documents. 

The terms that expand the. query are extracted from the 
top documents using the standard abstract operator in the 

a The ranking aJgorilhm was a straight for ward probabilif tic rele- 
vance ranking algorithm, one of the standard ranking methods avail- 
Able in the system. 

3 without the u« of human -fen erated re I cvanrp judgments 




Figure 1: A simple model of automatic query expansion. 

search engine. 4 The number of documents selected for term 
expansion and the number of terms extracted from those 
documents can be selected at run-time. In this experiment, 
the top 2 documents were used to select the top 10 terms for 
expansion. The extracted terms were added to the baseline 
query and then the expanded query was run against the full 
corpus. 

2.2 Past-query feedback 

Post-queries feedback required a set of queries to use for 
generating the affinity pool for each query. The past queries 
consisted of the 50 ad hoc TREC-5 queries, 50 ad hoc TREC- 
3 queries, and 10 ad hoc TREC-4 queries. 5 These 110 
queries were run against the TREC-5 data to generate re- 
sult lists, relevance ranked, then truncated after 200 and 
saved. These were then used as input to form an affinity 
pool for each query. 

To create an affinity pool for a query, we first computed 
the 3 queries most similar to it from the 109 other queries 
that exceeded a similarity threshold. Tf a query had 3 close 
matches then the affinity pool was created by pooling the 
top 100 documents from each of the matching past queries' 
result lists. A way to conceptualize this is: since the docu- 
ments retrieved by similar past queries that end up in the 
affinity pool are from the same corpus as the final result 
documents, in effect, the past queries affinity pool is a mask 
that prevents baseline-query top documents that are not also 
past-query top documents from being used for feedback. 

Feedback terms were the top ranked terms computed by 
the abstract operator when applied to the top documents 
from the result list of the baseline query executed against the 

*The abstract operator performs key term identification using a 
straightforward tf X id/ term-scoring algorithm similar to that de- 
scribed in (4). 

5 The TREC-3 and TREC-4 queries were created manually u.iing 
the same method aa for the TREC-5 test queries Only 10 of the 
TREC-4 topici were used because these are what we had available 
from another experiment. 
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affinity pool. Once the baseline query was augmented with 
feedback terms, it was executed against the full corpus. The 
same constants for number of top documents and number of 
top terms were used as for the top-document feedback run. 

The number of queries for which we could compute an 
affinity pool were > of course, dependent upon the threshold- 
For those queries that had no affinity pool, because there 
were less than 3 past queries similar enough to exceed the 
threshold, we used the baseline queries as part of the run. 
That is, no augmentation was done for some of the queries in 
the past-queries feedback run. We did this to make compar- 
isons between the baseline and past queries feedback easier. 

Two issues warrant further discussion. First the method 
of assessing inter- query similarity, and the threshold for de- 
termination of similar queries. 

2.3 Measuring Inter-query Similarity 

The query similarity method compared two queries by ex- 
amining the overlap between their result lists. Essentially, 
it is a weighted overlap measure for the top 200 documents 
of each result list. This is somewhat similar to the method 
used by Raghavan, et al.[12], but without the use of rele- 
vance judgments and extended to take into account empir- 
ically derived weights based on the probability of relevance 
of each matching hit. The weighting gave more preference 
to matches that occurred near the beginning of the lists, 
because the probability of finding a relevant document in a 
ranked result list is greatest at the beginning and decreases 
towards the tail. 

The similarity measure is described by Equation 1. L 
and L are result lists to be compared. D% is the \ lh element 
in the result list, n is the number of elements in the result 
list. Pos{L } Di) is a function that maps an element in the 
result list to its position in the result list L. wgt(P) is a 
weight function for different positions in the result list. The 
weight for a position in the result list is determined by the 
probability of finding a relevant document in that position, 
if a document does not appear in the result list, the weight 
is 0. 

In this experiment, a large-grained partition of the result 
lists was used to produce the position mapping. The weight 
for each position was empirically derived from the relevance 
ranked result lists of 50 TREC-3 ad hoc queries against the 
TREC-3 data collection. Table 1 shows the mapping from 
an element in the result list to its position in the result list, 
and the probability of rinding a relevant document in that 
position . For example, the 10** document in the result list 
will be mapped to position 0, which has a probability of 33% 
to be relevant. 

Closer examination of the result lists showed that much 
greater discrimination was possible by partitioning the result 
list more finely (Table 2). Though not evaluated, we expect 
this partition to produce a better similarity measure. 

2.4 Query Similarity Thresholds 

The past- queries feedback method is sensitive to whether or 
not there are any queries similar enough to the one being 
expanded to be used in generating the affinity pool. To 
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Probability 


. 0 




0.33 
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100-199 


0.10 
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not exist 


0.00 



Table 1 The mapping from ranges in the result list to posi- 
tions and the probability of relevance used in the experiment 
{wide range). 
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0.69 
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0.10 
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not exist 


0.00 



Table 2: The mapping from ranges in the result list to po- 
sitions and the probability of relevance (finer range) . 



avoid polluting the affinity pool with almost surely irrelevant 
documents in the low-similarity case, we used a threshold 
to eliminate poor matches. The threshold determined how 
many queries were expanded - the lower the threshold, the 
more queries are expanded. 

For the 50 TREC-5 queries, when each was compared 
against 109 other queries (for a total of 5450 comparisons), 
we observed the similarity measure to range between 0 and 
0.56. Theoretically it could range up to 1, but this happened 
only when comparing a query against itself. We ran test runs 
at two different thresholds, 0.02$ and 0.012, chosen because 
of the number of queries that would be augmented (we didn't 
want too few, or too many). These thresholds yielded 23 
and 35 queries (out of the 50 TREC-5 queries) respectively 
which have affinity pools. The effects of these thresholds on 
recall-precision and average precision are described in the 
following section. 

3 Discussion of Results 

Every query in top-document feedback had an affinity pool 
and so feedback was used for all 50 TREC-5 ad hoc queries. 
Top-document feedback showed an improvement of 7.5% 
over the baseline in average precision (Table 3) and an im- 
provement of 13.5% in the total number of documents re- 
trieved. Compared to the baseline, precision was higher for 
top-document feedback at all recall values except 0 (Fig- 
ure 2). However, precision at top 5 and top 10 documents 
suffered, with a drop of 5% on average. Precision at all 
other points up to the 100 document precision level showed 
an improvement over the baseline (Figure 3). 

Unlike in top-document feedback, some TREC-5 topics 
had no similar past queries to create affinity pools in past- 
queries feedback. For these topics, the baseline results were 
used. Though we saw improvement of precision at almost 
all recall levels for all queries (Figure 2), the actual effect 
of past-queries feedback is best revealed when only the re- 
sults from queries which have affinity pools are compared 
(Figure 4). With only the queries that were augmented by 
past-queries feedback, precision improved at all recall levels. 

As expected, top-document precision suffered with the 
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Figure 2: RecaJ I- Precision for baseline, top-document and past-queries feedback methods for all TREC-5 queries. 
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Figure 3. Dotuuipiu-Precision at top 100 documents for baseline, top-document and past-queries feedback methods for the 
23 TREC-5 augmented queries only. 
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Table 3: Average and exact precision for baseline, top-document and past-queries feedback methods. 
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Figure 4: Recall- Precision for baseline, top-document and past- queries feedback methods for the 23 TREC-5 augmented 
queries only. 



top-document feedback method. However, this was not the 
case for past-queries feedback. Precision at top 5 and top 10 
documents in past-queries feedback improved 6% on average 
(Figure 3). 

T&ble 3 shows the average precision and exact precision 
for the baseline, top-document feedback and past-queries 
feedback methods. For all 50 TREC-5 ad-hoc queries, past- 
queries feedback showed an average precision improvement 
of 15 5% over the baseline, an improvement of 7.6% over 
top-document feedback. 

Comparing average precision for the expanded queries 
only (23 of the 50) t past-queries feedback showed an im- 
provement of 38.3% over the baseline and an improvement 
of 12.1% over top-document feedback. The difference be- 
tween this and the measures for all queries is due to the fact 
that the baseline query result was used when an affinity pool 
could not be found for a query in the past-queries feedback 
method. 

We experimented with four affinity pool thresholds. Two 
were used to demonstrate the effect of thresholds on the av- 
erage precision and recall-precision curve. 6 The tighter 
the threshold, the fewer queries with affinity pools. When 
the threshold was 0 025, 23 topics had affinity pools. When 
the threshold was 0.012, 35 topics had affinity pools. Aver- 
age precision and exact precision improves as the threshold 
tightens. For example, average precision improved 24.0% 

*The results shown above are from the 0.025 threshold. 



when the threshold was increased from 0.012 to 0.025. 

Figure 5 shows the recaU-precision curve for the two 
thresholds. Similar to average precision, precision was im- 
proved at all recall value when the threshold was increased. 

4 Related Wor* 

Raghavan, et al.|12] reported on efforts to reuse past opti- 
mal queries for either of two purposes: to short-circuit query 
evaluation by matching a user query to one that had al- 
ready been evaluated, and to jump-start optimal query for- 
mation using a steepest descent method by using the closest 
matching past query as the initial query vector. To frame 
this effort, they presented a detailed analysis of similarity 
measures between queries. They concluded that the best 
similarity metric between queries was obtained by compar- 
ing query result vectors rather than treating the queries as 
term- vectors. They presented distance and similarity algo- 
rithms that used the relevant and non-relevant document 
overlap and ordering relations between query results. Us- 
ing the Cran field 1400 collection they attempted to validate 
that the use of the best match past query as the initial query 
in a steepest descent query refinement method reduced the 
number of iterations needed to reach an optimal query. 

Our effort differed from Raghavan 's in several areas. First, 
the goal of the their experimental effort was to accelerate 
formation of optima) queries in the presence of relevance 
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Figure 5: Recall-Precision for past-queries feedback methods for augmented queries only using various thresholds. 



judgments. Our goal was to improve the effectiveness of a 
user-supplied query by identifying key terms from poten- 
tially relevant documents from past queries. 

Both Raghavan's and our approaches used an inter-query 
similarity metric based on result-set overlap. Raghavan, et 
al, showed how this was better than other methods. The 
Raghavan metric used in the experiments relied on rele- 
vant document overlap between the two result lists, but we 
expressly avoided using relevant documents since they are 
rarely available in practice and so we used common docu- 
ment overlap. The assumption we made was that the prob- 
ability of two queries being similar is increased if they share 
any retrieved documents in common. 

Most significantly, we used the fact that the beginning of 
a result list is more likely to include a relevant document by 
weighting matches by their position in the result list. The 
weights were empirically derived probabilities of different 
result list ranges to contain relevant documents. 

T.L. Brauen(2] reported on a set of experiments using the 
Cranfield collection (424 documents, 155 queries) that used 
past queries to improve the effectiveness of future queries. 
The approach he used was to select a handful of relevant 
documents from each past query and modify these relevant 
documents' vectors by the query vector that selected them 
This had the effect of injecting the searcher's understanding 
of the document into the document's concept vector. Future 
queries were executed against the modified document vec- 
tors. In evaluating results, Brauen distinguished between 
queries that had other similar past queries and queries that 
did not. He reported reasonable recall /precision, average 
recall and average precision gains for this method. 

Our approach differed from Brauen 's effort in several 
ways First, Brauen used relevance judgments to decide 
which documents' vectors to mod try whereas our effort did 
not rely on relevance judgments at all (except for evalua- 
tion). Second, Brauen '5 method modified the underlying 
corpus index, whereas our method modified just queries. 
Our computation over past queries need not involve the 



database system's index at all, other than to query it. 7 Fi- 
nally, Brauen' query-similarity metric compared the queries 
by comparing their term -vectors, we compared them by their 
result-lists. 

Worona (13J reported on experiments with the Cranfield 
collection (424 documents, 155 queries) that retrieved doc- 
uments by first clustering queries. The objective was to 
evaluate cluster searching with clusters generated by clus- 
tering queries not documents. In this method, query clusters 
were formed using a clustering algorithm over the concept 
vectors of each query. Then document clusters were formed 
by replacing each query cluster with the relevant documents 
from each of the queries in the cluster. Queries were first 
executed to identify the best clusters, then the clusters were 
expanded and the documents ranked against the query. This 
was compared against cluster-searching where the clusters 
were formed by document clustering, and against full search- 
ing. The results indicated that "with 'good' enough clusters 
and centroids, a clustered search need not lose a great deal 
of the recall compared with a full search.' 1 

Worona's effort and goals differed substantively from ours 
in that he evaluated cluster- searching in order to optimize 
query execution speed without loss of search effectiveness. 
Conversely, our use of past queries was strictly to improve 
effectiveness without concern for execution speed. However, 
it is worth considering his efforts in tight of our model of 
query expansion. His formation of a cluster is analogous 
to our formation of an affinity pool In generating clusters, 
queries were treated as term- vectors and compared with a 
cosine measure. 6 Additionally, the two approaches differed 
in the application of the clusters: his method attempted to 
identify a cluster of documents to be retrieved, whereas our 
method attempted to identify an affinity pool from which 

7 Some thought is needed to determine how beat to utilize a method 
that modified the index, such its Brauen's. in the face of heavily up- 
dated text databases. 

"Raghavan, et at. [12] showed haw the cosine comparison method 
was inappropriate for comparing queries- 
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expanded terms coutd be extracted and used to refine the 
query. Finally, his method of cluster formation relied on rel- 
evant documents, but ours did not. The common thread is 
that queries can be used to identify document clusters. 

A common difference between these other studies and 
ours was the difference in test collection characteristics. The 
other studies all used some form of the Cranfield collection 
whereas we used the TREC-5 data collection and TREC- 
3, 4, and 5 ad-hoc queries. Cranfield is very topic specific 
(aeronautics) whereas the TREC collection is not; the Cran- 
field documents are bibliographic references with abstracts 
whereas TREC is full-text. The TREC collection is about 
3 orders of magnitude larger than Cranfield. However, the 
number of queries used was very similar (Brauen used 155, 
Worona used 155, Raghavan used 222, and we used 110). 

5 Conclusions and Future Work 

The goal of our effort was to determine whether past queries 
could be used as an additional source of evidence to improve 
automatic query expansion for a user- provided query. This 
goal was formulated as one possible solution co the problem 
of small user queries in the presence of very large heteroge- 
neous databases. The results are encouraging. 

It appears that there is implicit relevance information 
in past queries that can improve automatic feedback and 
that this information was tapped by the affinity pool cre- 
ation algorithm and inter-query similarity metric. Auto- 
matic past queries feedback outperformed automatic top 
document feedback and actually improved top document 
precision. Consistent with the literature (e.g., [4]), auto- 
matte top document feedback was nominally better than the 
baseline and slightly hurt top document precision. 

When- we compared only the 23 queries for which past- 
queries expansion was performed, average precision for past- 
queries feedback was 38% over the baseline and 12% bet- 
ter than top-document feedback. Past queries feedback im- 
proved the top 5 and top 10 document precision levels by 
about 6%. The full 50 automatic past-queries feedback 
run (using the baseline query for the 27 queries for which 
past* queries feedback could not be performed) showed a 
15.5% improvement over the baseline in average precision 
and 7.6% over top-document feedback when all queries were 
compared. Total recall improved slightly for past- queries 
feedback against the baseline. 

Interestingly, the performance of the baseline queries for 
which past-queries feedback could be executed was worse 
than the complete baseline run compared over average pre- 
cision. By subjective inspection, at least 6 of these queries 
were vague or abstract, were very poor performers in the 
baseline, and had higher improvement by past queries feed- 
back than the other queries. We're tempted to speculate 
that for difficult queries, the result lists of similar other 
queries may act as a noise filter for top-document feedback, 
but the numbers are too small to say. 

Finally, the use of a threshold for past-queries feedback 
to control when the technique is applied involved some cal- 
ibration, but appeared to be useful. The runs indicated an 
inverse relationship between the threshold and the effective- 
ness of the technique. For the two points shown here, a 
higher threshold reduced the number of queries that were 
expanded, but increased average precision by 24%. 

The limitations of this effort are the small number of 
queries tested and the runtime cost of the implementation 
that we used. Also, we are unsure of how the empirically 
derived probability of relevance data holds across different 



collections and query sets. Subject to validation on a larger 
query set, we're very interested in improving both the algo- 
rithm and the implementation, and in introducing additional 
sources of evidence to feedback in general. 

As search activities are viewed less as one-shot events and 
more as milestones along the road to satisfying long-term 
information needs, useful patterns will become evident. In 
most on-line systems with high throughput, anything that 
increases the cost of a single search is viewed with a critical 
eye. Augmentation techniques generally increase the cost of 
a search by more than 2X (because the search is executed 
twice), and cumulative query cost translates directly into 
hardware cost or response time. We're particularly intrigued 
by such modeling and query expansion when it can take 
place off the search engine server. 
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